The Online Loop-free Stochastic Shortest-Path Problem
نویسندگان
چکیده
We consider a stochastic extension of the loop-free shortest path problem with adversarial rewards. In this episodic Markov decision problem an agent traverses through an acyclic graph with random transitions: at each step of an episode the agent chooses an action, receives some reward, and arrives at a random next state, where the reward and the distribution of the next state depend on the actual state and the chosen action. We consider the bandit situation when only the reward of the just visited state-action pair is revealed to the agent. For this problem we develop algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability α > 0 under all policies, we give an algorithm and prove that its regret is O(L2 √ T |A|/α), where T is the number of episodes, A denotes the (finite) set of actions, and L is the length of the longest path in the graph. Variants of the algorithm are given that improve the dependence on the transition probabilities under specific conditions. The results are also extended to variations of the problem, including the case when the agent competes with time varying policies.
منابع مشابه
The adversarial stochastic shortest path problem with unknown transition probabilities
We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is e...
متن کاملDynamic Multi Period Production Planning Problem with Semi Markovian Variable Cost (TECHNICAL NOTE)
This paper develops a method for solving the single product multi-period production-planning problem, in which the production and the inventory costs of each period arc concave and backlogging is not permitted. It is also assumed that the unit variable cost of the production evolves according to a continuous time Markov process. We prove that this production-planning problem can be Stated as a ...
متن کاملDensity estimation of shortest path lengths in spatial stochastic networks
We consider a spatial stochastic model for telecommunication networks, the stochastic subscriber line model, and we investigate the distribution of the typical shortest path length between network components. Therefore, we derive a representation formula for the probability density of this distribution which is based on functionals of the so-called typical serving zone. Using this formula, we c...
متن کاملSolving Stochastic Shortest-Path Problems with RTDP
We present a modification of the Real-Time Dynamic Programming (rtdp) algorithm that makes it a genuine off-line algorithm for solving Stochastic Shortest-Path problems. Also, a new domainindependent and admissible heuristic is presented for Stochastic Shortest-Path problems. The new algorithm and heuristic are compared with Value Iteration over benchmark problems with large state spaces. The r...
متن کاملAn Online Convergent Q-learning Algorithm with Linear Function Approximation
We present in this article a variant of Q-learning with linear function approximation that is based on two-timescale stochastic approximation. Whereas it is difficult to prove convergence of regular Q-learning with linear function approximation because of the off-policy problem, we prove that our algorithm is convergent. Numerical results on a multi-stage stochastic shortest path problem show t...
متن کامل